In this exercise, we will be using functions from the tidyverse package. Before you use an R package, you need to load it into your session by calling the library function. It’s a good idea to load all of the packages you need in a single code chunk at the top of your Rmarkdown file, like this:

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

(a) Read in the Olympic 100m sprint medal results

The file Olympic_100m_results.csv contains information on all Olympic 100m sprint medalists. Open it in Excel first to see what it looks like, then read it into R using the read_csv function. You’ll need to give the data frame a name; choose olympic_100m_data if you want to be consistent with the rest of the exercise and the solutions we provide.

Hints:

  • If you’re using a Windows computer, you will need to close the file in Excel before you can open it in using other software, including R.

  • If you get error messages about the file not being found, double check you have typed the name correctly, and make sure your R working directory is set right. The RStudio menu option Session > Set Working Directory > To Source File Location is often useful.

  • If you get error messages about read_csv not being found, make sure you have loaded the tidyverse package by running the code chunks near the top of this file.

  • Make sure you are using the read_csv function and not the read.csv function as these will not provide the same output.

olympic_100m_data <- read_csv("Olympic_100m_results.csv")
Rows: 138 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Gender, Event, Location, Medal, Name, Nationality
dbl (3): Year, Result, Time
lgl (1): Wind

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

(b) Looking at a data frame using RStudio

When you load the data file, you should see its name appear in the Environment tab in RStudio (normally in the top-right panel).

How many observations and how many variables does this data frame have?

What happens when you click the blue arrow to the left of the data frame? What do you think chr, num and logi mean?

What happens when you click on the name of the data frame (e.g. olympic_100m_data)?

What do you think the data in each column represents?

You can see from the list in the top right that there are 138 observations (rows) and 10 variables (columns).

The blue arrow reveals a list of columns in the data set, their data type (chr for character, i.e. text; num for numeric; logi for logical, i.e. dichotomous or binary data).

Clicking on the data frame name shows a spreadsheet-style view of the data in the top-left panel.

(c) Looking at a data frame using code

Type the name of the data frame (e.g. olympic_100m_data) into the R console on the bottom-left and press enter. What do you see?

Type glimpse(NAME_OF_DATA_FRAME) into the R console (changing that to the name of the data frame you chose earlier). What do you see?

Now add these statements to an R code chunk. Try running them from the R Markdown document (Ctrl-Enter/Command-Enter or green ‘play’ button on the right.) Finally, knit the document into an html file and look at the result.

olympic_100m_data
# A tibble: 138 × 10
   Gender Event    Location   Year Medal Name         Natio…¹ Result Wind   Time
   <chr>  <chr>    <chr>     <dbl> <chr> <chr>        <chr>    <dbl> <lgl> <dbl>
 1 M      100M Men Rio        2016 G     Usain BOLT   JAM       9.81 NA     9.81
 2 M      100M Men Rio        2016 S     Justin GATL… USA       9.89 NA     9.89
 3 M      100M Men Rio        2016 B     Andre DE GR… CAN       9.91 NA     9.91
 4 M      100M Men Beijing    2008 G     Usain BOLT   JAM       9.69 NA     9.69
 5 M      100M Men Beijing    2008 S     Richard THO… TTO       9.89 NA     9.89
 6 M      100M Men Beijing    2008 B     Walter DIX   USA       9.91 NA     9.91
 7 M      100M Men Sydney     2000 G     Maurice GRE… USA       9.87 NA     9.87
 8 M      100M Men Sydney     2000 S     Ato BOLDON   TTO       9.99 NA     9.99
 9 M      100M Men Sydney     2000 B     Obadele THO… BAR      10.0  NA    10.0 
10 M      100M Men Barcelona  1992 G     Linford CHR… GBR       9.96 NA     9.96
# … with 128 more rows, and abbreviated variable name ¹​Nationality
glimpse(olympic_100m_data)
Rows: 138
Columns: 10
$ Gender      <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
$ Event       <chr> "100M Men", "100M Men", "100M Men", "100M Men", "100M Men"…
$ Location    <chr> "Rio", "Rio", "Rio", "Beijing", "Beijing", "Beijing", "Syd…
$ Year        <dbl> 2016, 2016, 2016, 2008, 2008, 2008, 2000, 2000, 2000, 1992…
$ Medal       <chr> "G", "S", "B", "G", "S", "B", "G", "S", "B", "G", "S", "B"…
$ Name        <chr> "Usain BOLT", "Justin GATLIN", "Andre DE GRASSE", "Usain B…
$ Nationality <chr> "JAM", "USA", "CAN", "JAM", "TTO", "USA", "USA", "TTO", "B…
$ Result      <dbl> 9.81, 9.89, 9.91, 9.69, 9.89, 9.91, 9.87, 9.99, 10.04, 9.9…
$ Wind        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Time        <dbl> 9.81, 9.89, 9.91, 9.69, 9.89, 9.91, 9.87, 9.99, 10.04, 9.9…

(d) Looking at columns of numeric data

R refers to an individual column within a data frame using the notation DATA_FRAME_NAME$COLUMN_NAME. For example, the column Time in the data frame olympic_100m_data would be written olympic_100m_data$Time.

  • What happens when you write a line of code with just the name of a column within a data frame, e.g. olympic_100m_data$Time

  • What happens if you misspell the name of the column? What about using uppercase and lowercase differently from the name in the data frame?

  • There are a number of functions built into R which operate on numeric vectors. Some useful ones include mean, sd, min, max and median. Try calling these on the Time column; e.g. mean(olympic_100m_data$Time)

R can also do arithmetic calculations, like a pocket calculator or Excel formula. You can mix single numbers with vectors (columns of numbers) and R will usually do something sensible. For example, 2 * column_name will return a vector where each element of the original column has been doubled, while column1 + column2 will add corresponding elements of each column.

  • Calculate the speed of each runner in metres per second. (Hint: speed is distance / time, and this data frame only contains information about a 100m race.)

  • What is the fastest speed in metres per second? (Hint: use the max function.)

olympic_100m_data$Time
  [1]  9.81  9.89  9.91  9.69  9.89  9.91  9.87  9.99 10.04  9.96 10.02 10.04
 [13]  9.99 10.19 10.22 10.06 10.08 10.14  9.90 10.00 10.00 10.20 10.20 10.30
 [25] 10.40 10.40 10.40 10.30 10.40 10.50 10.80 10.90 10.90 10.80 10.80 10.90
 [37] 10.80 11.00 11.10 11.20  9.63  9.75  9.79  9.85  9.86  9.87  9.84  9.89
 [49]  9.90 10.25 10.25 10.39 10.14 10.24 10.33 10.00 10.20 10.20 10.50 10.50
 [61] 10.60 10.30 10.40 10.60 10.30 10.30 10.40 10.60 10.70 10.80 10.80 10.90
 [73] 10.90 11.00 11.20 11.20 12.00 12.20 12.60 12.60 10.71 10.83 10.86 10.78
 [85] 10.98 10.98 11.12 11.18 11.19 10.82 10.83 10.84 10.97 11.13 11.16 11.08
 [97] 11.13 11.17 11.00 11.10 11.10 11.00 11.30 11.30 11.50 11.80 11.90 11.50
[109] 11.70 11.90 12.20 10.75 10.78 10.81 10.93 10.96 10.97 10.94 10.94 10.96
[121] 11.06 11.07 11.14 11.07 11.23 11.24 11.40 11.60 11.60 11.50 11.70 11.70
[133] 11.90 12.20 12.20 11.90 11.90 12.00
olympic_100m_data$Tiem
Warning: Unknown or uninitialised column: `Tiem`.
NULL
olympic_100m_data$time
Warning: Unknown or uninitialised column: `time`.
NULL
mean(olympic_100m_data$Time)
[1] 10.77674
sd(olympic_100m_data$Time)
[1] 0.6761183
min(olympic_100m_data$Time)
[1] 9.63
max(olympic_100m_data$Time)
[1] 12.6
median(olympic_100m_data$Time)
[1] 10.805
100 / olympic_100m_data$Time
  [1] 10.193680 10.111223 10.090817 10.319917 10.111223 10.090817 10.131712
  [8] 10.010010  9.960159 10.040161  9.980040  9.960159 10.010010  9.813543
 [15]  9.784736  9.940358  9.920635  9.861933 10.101010 10.000000 10.000000
 [22]  9.803922  9.803922  9.708738  9.615385  9.615385  9.615385  9.708738
 [29]  9.615385  9.523810  9.259259  9.174312  9.174312  9.259259  9.259259
 [36]  9.174312  9.259259  9.090909  9.009009  8.928571 10.384216 10.256410
 [43] 10.214505 10.152284 10.141988 10.131712 10.162602 10.111223 10.101010
 [50]  9.756098  9.756098  9.624639  9.861933  9.765625  9.680542 10.000000
 [57]  9.803922  9.803922  9.523810  9.523810  9.433962  9.708738  9.615385
 [64]  9.433962  9.708738  9.708738  9.615385  9.433962  9.345794  9.259259
 [71]  9.259259  9.174312  9.174312  9.090909  8.928571  8.928571  8.333333
 [78]  8.196721  7.936508  7.936508  9.337068  9.233610  9.208103  9.276438
 [85]  9.107468  9.107468  8.992806  8.944544  8.936550  9.242144  9.233610
 [92]  9.225092  9.115770  8.984726  8.960573  9.025271  8.984726  8.952551
 [99]  9.090909  9.009009  9.009009  9.090909  8.849558  8.849558  8.695652
[106]  8.474576  8.403361  8.695652  8.547009  8.403361  8.196721  9.302326
[113]  9.276438  9.250694  9.149131  9.124088  9.115770  9.140768  9.140768
[120]  9.124088  9.041591  9.033424  8.976661  9.033424  8.904720  8.896797
[127]  8.771930  8.620690  8.620690  8.695652  8.547009  8.547009  8.403361
[134]  8.196721  8.196721  8.403361  8.403361  8.333333
max(100 / olympic_100m_data$Time)
[1] 10.38422
# or:
100 / min(olympic_100m_data$Time)
[1] 10.38422

(e) Looking at columns of categorical data

The Event and Medal columns in this data frame are categorical.

  • How does R display these columns if you type in their name, e.g. olympic_100m_data$Event?

  • You can get a simple frequency table using the table() function. Try calling these on the Event and Medal columns, e.g. table(olympic_100m_data$Event)

olympic_100m_data$Event
  [1] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
  [6] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [11] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [16] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [21] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [26] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [31] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [36] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [41] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [46] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [51] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [56] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [61] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [66] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [71] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [76] "100M Men"   "100M Men"   "100M Men"   "100M Men"   "100M Men"  
 [81] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
 [86] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
 [91] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
 [96] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[101] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[106] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[111] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[116] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[121] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[126] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[131] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[136] "100M Women" "100M Women" "100M Women"
table(olympic_100m_data$Event)

  100M Men 100M Women 
        80         58 
olympic_100m_data$Medal
  [1] "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
 [19] "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
 [37] "G" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S"
 [55] "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S"
 [73] "B" "G" "S" "B" "G" "S" "B" "B" "G" "S" "B" "G" "S" "S" "S" "S" "B" "G"
 [91] "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G"
[109] "S" "B" "G" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
[127] "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
table(olympic_100m_data$Medal)

 B  G  S 
45 46 47 

(f) Getting help within R

There are a few ways to get help within R. Try these out:

  1. Click on the name of a function in a code chunk, and press F1.
  2. Type ?functionname in the R console pane (bottom left - don’t add this to your R Markdown file). e.g. ?table (don’t type the back ticks either!)
  3. Type help("functionname") in the R console. e.g. help("mean")
  4. Type help(package = "packagename") in the R console to get help on an R package, e.g. help(package = "ggplot2")

The first three all do the same thing.

What do you notice about the documentation which comes with R?

The R documentation is terse and nerdy but comprehensive. It is okay for a reference but not very helpful when you’re first learning R.

R packages often contain a lot of different functions and looking at the list of functions in package is often unhelpful. There is a link “user guides, package vignettes and other documentation” in small font near the top which sometimes has more helpful help!

(g) Clearing your R session

There are two common ways to restore your R session to a fresh start.

  1. Clear all variables in your workspace: click the ‘broom’ icon in the Environment panel (top-right) or Session > Clear Workspace in the menu.

  2. Restart your R session: Session > Restart R in the menu.

Try both of these. What do you think the difference is? (Hint: try running the code chunk with read_csv after each of them.)

Restarting the R session will also clear any loaded packages (e.g. tidyverse) and reset your session’s working directory.


© 2022 Statistical Consulting Centre, The University of Melbourne.